News headlines serve as flagships for the events that dictate the rise and crash of equity as the stock market is held under the pervasive influence of media. This begs the question of: how we can use the availability of news information to methodically quantify the effect of media on a company’s stock price? Using data freely available from the GDELT project and Yahoo Finance, this study leveraged Amazon Web Services as a platform for Pyspark which the authors utilized to combine company stock price information and an aggregated media tone from several reports coming from different news sources. The Closing Price for several companies was then modelled using the Merlion Time Series DefaultForecaster using an 80% Train and 20% Test split. Tesla Stock produced a MAE of 2.47 when the cumulative sum of the daily media tone from different sources is included as feature and forecasted using the last value of a 3-Day Sliding Prediction Window. When back testing is utilized with the model’s prediction, the study produces a return of -9.64 compared to -61.67 when using moving averages, showing definite improvement. To further refine the results coming from this study, the authors suggest the following: keyword choice augmentation when scraping web articles from the GDELT project, a granular incorporation of long and short range data coming also coming from GDELT, pre-processing steps and hyperparameter tuning for the Time Series models utilized, and finally, company profiling to refine the selection criteria of entities suited for this methodology.
from datetime import datetime
import math
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly
from IPython.display import HTML
from IPython.display import Image as _image
import pickle
from backtesting import Backtest, Strategy
from backtesting.lib import crossover
from backtesting.test import SMA
plotly.offline.init_notebook_mode()
COLORS = ['#4db4d7']
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<style>
.output_png {
display: table-cell;
text-align: center;
horizontal-align: middle;
vertical-align: middle;
margin:auto;
}
tbody, thead {
margin-left:100px;
}
</style>
<form action="javascript:code_toggle()"><input type="submit"
value="Click here to toggle on/off the raw code."></form>''')
How might we measure the effect of media sentiments on stock prices?
Systematic risk represents the volatility and dynamics of the stock market reliant on factors outside of the companies involved such as government policy and mass media. It is unpredictable and impossible to completely avoid. Therefore, there is benefit in being able to incorporate this in models used to predict the stock index. [3]
Mass media acts as a representation of systematic risk, as its influence is pervasive throughout society and its institutions, the stock market included. Our study wanted to see how much influence mass media has in forecasting the closing price of a company's stock. For example, continuous negative press on a company can influence its stock price to go down continuously throughout that period. Being able to predict the early signs of the downtrend can mean selling early and net loss is minimized.
display(HTML('''<center style="font-size:12px;font-style:default;"><b>
Figure 1. Portfolio Risk.
</b></center>'''))
display(_image(filename="img/risk.png"))
The GDELT project has almost 250 million event record for a worldwide range of categories from 1979 up to the present. It is a network connecting each record’s themes, locations, personalities, organizations, and locations with corresponding emotional assessments. It is meant to be a repository and a representation of the world’s behavior recorded through events. It covers most of the world events and provide details such as people or groups involved, its context and what tone the record reflects. Different media such as broadcast, print, and web news released by several media sources worldwide are recorded[1].
The dataset used for this study is the events dataset which has over 143,000 files and is 102 GB in size. The events dataset is a table updated every 15 minutes. The dataset in the AWS open registry covers data from 2015 to Q1 2019. GDELT events dataset has several features, 61 in total, but we chose the following features to be in the final dataset:
display(HTML('''<center style="font-size:12px;font-style:default;"><b>
Figure 2. Initial Data Size Scan.
</b></center>'''))
display(_image(filename="img/datasize.png"))
The average tone serves as the key feature to forecast a company’s stock price in this study. It is the average “tone” of all documents containing one or more mentions of this event. The common values range between -10 and +10, with 0 indicating neutral. It is used for filtering the “context” of events as a subtle measure of the importance of an event and as a proxy for the “impact” of that event. This provides only a basic tonal assessment of an article. This also ranges from -100 to 100. [2]
For the corresponding stock information for each company analyzed, we utilized Yahoo Finance which is one of the largest sources of stock quotes and other financial information available. It holds information for over 37 thousand stocks for over 50 countries.
| Feature Name | Description |
|---|---|
| Date Added | stores the date the event was added to the master database |
| Actor1Name | first of two actors involved in the event. It could be proper or formal names, names of countries and cities, ethnic groups, religious groups, etc |
| Actor2Name | second of two actors involved in the event. It could be proper or formal names, names of countries and cities, ethnic groups, religious groups, etc |
| URL | the URL or citation of the news report the event was found in |
| Average Tone | a quantitative measure of the positive, negative, or neutral sentiment for the event record |
Let us visualize the average tone score and cumulative average tone score of the companies for each year to get the trend of the sentiments for the written artciles.
tesla_tone = pd.read_csv('data/tesla.csv')
fb_tone = pd.read_csv('data/facebook.csv')
nf_tone = pd.read_csv('data/netflix.csv')
amzn_tone = pd.read_csv('data/amazon.csv')
for tone in [tesla_tone, fb_tone, nf_tone, amzn_tone]:
tone['Date'] = (tone['DATEADDED'].astype('str')
.apply(lambda x: datetime(int(x[:4]),
int(x[4:6]),
int(x[6:]))))
tone.drop('DATEADDED', axis=1, inplace=True)
tone['year'] = tone.Date.dt.year.astype('int')
tone = tone.groupby('year')['Daily Average Tone'].mean()
display(HTML('''<center style="font-size:12px;font-style:default;"><b>
Figure 3. Plots of time series tone score.
</b></center>'''))
display(HTML(f'''<h3 style="text-align:center">
Average Yearly<b style="color:{COLORS[0]}">
tone score</b> of the companies
</h3>'''))
fig, ax = plt.subplots(figsize=(10, 6))
names = ['tesla', 'facebook', 'netflix', 'amazon']
colors = ['#212121', '#3b5998', '#e50914', '#e47911']
for tone, name, color in zip([tesla_tone, fb_tone, nf_tone,
amzn_tone], names, colors):
(tone.groupby('year')['Daily Average Tone'].mean()
.plot(color=color, lw=4, ax=ax, label=name))
plt.xlabel("Year", fontsize=12)
plt.ylabel("Average Ratings", fontsize=12)
ax.legend()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
tesla_cumsum_tone = pd.read_csv('data/tesla_cumsum.csv')
fb_cumsum_tone = pd.read_csv('data/facebook_cumsum.csv')
nf_cumsum_tone = pd.read_csv('data/netflix_cumsum.csv')
amzn_cumsum_tone = pd.read_csv('data/amazon_cumsum.csv')
for tone in [tesla_cumsum_tone, fb_cumsum_tone,
nf_cumsum_tone, amzn_cumsum_tone]:
tone['Date'] = (tone['DATEADDED'].astype('str')
.apply(lambda x: datetime(int(x[:4]),
int(x[4:6]),
int(x[6:]))))
tone.drop('DATEADDED', axis=1, inplace=True)
tone['year'] = tone.Date.dt.year.astype('int')
tone = tone.groupby('year')['Daily Average Tone'].mean()
display(HTML(f'''<h3 style="text-align:center">
Average yearly<b style="color:{COLORS[0]}">
cumulative tone score</b> of the companies
</h3>'''))
fig, ax = plt.subplots(figsize=(10, 6))
names = ['tesla', 'facebook', 'netflix', 'amazon']
colors = ['#212121', '#3b5998', '#e50914', '#e47911']
for tone, name, color in zip([tesla_cumsum_tone, fb_cumsum_tone,
nf_cumsum_tone, amzn_cumsum_tone],
names, colors):
(tone.groupby('year')['Daily Average Tone'].mean()
.plot(color=color, lw=4, ax=ax, label=name))
plt.xlabel("Year", fontsize=12)
plt.ylabel("Average Ratings", fontsize=12)
ax.legend()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
We can observe that Tesla has always been consistent at the top of the tone score except for the 2018 where there is a news about its owner Elon Musk over fraud charges and news about the Tesla car crash during its autopilot. Aside from that year, it remained to the top of the tone scores. We can also see that Facebook is far behind at the bottom in terms of the tone score. Overall, most of these big companies have negative tone score of news articles
Let us also visualize the actual performance of stocks during the same period of study from 2015 to 2019.
def show_candlesticks(df, symbols='All'):
"""
Figure that shows the candlesticks of a market
Parameters
===========
df dataframe
object having data
symbol list
symbols you want to graph (recommended 1 symbol only)
"""
# create figure object for suplots
fig = make_subplots(rows=1,
cols=1,
subplot_titles=[symbols])
# update figure size
fig.update_layout(
autosize = False,
width = 950 * 1,
height = 500 * 1,
paper_bgcolor = "LightSteelBlue"
)
# loop through all the symbols needed to plot
row = 1
col = 1
df_market = df.copy()
fig.add_trace(
go.Candlestick(x=df.index,
open=df_market['Open'],
high=df_market['High'],
low=df_market['Low'],
close=df_market['Close'],
),
row=row, col=col
);
fig.show();
tesla = pd.read_csv('data/tesla_complete.csv').set_index('Date')
fb = pd.read_csv('data/facebook_completed.csv').set_index('Date')
nf = pd.read_csv('data/netflix_completed.csv').set_index('Date')
amzn = pd.read_csv('data/amazon_completed.csv').set_index('Date')
display(HTML('''<center style="font-size:12px;font-style:default;"><b>
Figure 4. Plot of time series actual stock prices.
</b></center>'''))
for stock, title in zip([tesla, fb, nf, amzn],
['Tesla', 'Facebook', 'Netflix', 'Amazon']):
show_candlesticks(stock, title)
Most of the stock prices in the chosen companies increase during this period. Relating it to the results of the tone score that we got earlier. It seems that it follows a certain trend. For example, Tesla stock price started to get rocky at 2018 onwards similar to the tone score that the company got. Similarly, Netflix and Amazon experienced a drop at 2019 similar to the resutls of the tone score. It seems that there are some correlation between the two variables.
display(HTML('''<center style="font-size:12px;font-style:default;"><b>
Figure 5. Methodology of the Project.
</b></center>'''))
display(_image(filename="img/methodology.png"))
The GDELT dataset is publicly available in the Open Data Registry from AWS. The events dataset consists of csv files corresponding to 15 minute intervals of event recording . A 4-instance EMR cluster was prepared for this study which includes 1 master and 3 core instances. The latest EMR release was used (6.5.0). The machines used are all m5.xlarge nodes with 4 cores, 16 GB memory each and EBS storage of 64GB. The EBS root volume size was set to 100 GB.
display(HTML('''<center style="font-size:12px;font-style:default;"><b>
Figure 6. AWS Cluster Configuration.
</b></center>'''))
display(_image(filename="img/cluster.PNG"))
Data preprocessing was done by creating several queries involving the chosen features. Several iterations were done for 4 companies named Tesla, Facebook, Netflix and Amazon. The company name was used as the search term or keyword and was checked as an entry for the actor names and the URL. After filtering using the keyword, information such as the date added and the corresponding average tone were extracted in order to be used for the study. Files were saved for each company pertaining to each date and their corresponding tone score for the entire date range of 2015 to 2019. For reference please see the GDELT download supplementary notebook.
Two types of average tone data were extracted from the main GDELT database. First was the mean daily tone which is a straightforward average of all tone score per day. The second was a cumulative daily tone which is a running 2 interval cumulative score computation for every available 15-minute update, and then averaged per day. This was done top get a feel of how the tone goes throughout the day.
display(HTML('''<p style="font-size:12px;font-style:default;"><b>
Table 1. Sample output of the tone score.
</b></p>'''))
tesla_tone = pd.read_csv('data/tesla.csv')
display(tesla_tone.head())
Yahoo Finance is one of the larges sources of stock quotes and other financial information that are being updated everyday. The team performed webscraping using yahoo finance api. We scraped for the historical stock prices containing high, low, volume, open, and close of each day for the four companies that we wanted to explore namely: Facebook, Tesla, Netflix, and Amazon.
display(HTML('''<p style="font-size:12px;font-style:default;"><b>
Table 2. Snippet of the output of yahoo finance.
</b></p>'''))
tesla = pd.read_csv('data/tesla_complete.csv')
display(tesla.iloc[:, :6].head())
After web scraping we computed for the technical indicators for each of the companies like MACD, Bollinger, RSI, and moving averages crossover. This will be used as a baseline of creating a trading strategy. This could also work in conjunction with the model that we developed. We then combined it with the average and cumulative tone score for each day that was scraped to GDELT.
display(HTML('''<p style="font-size:12px;font-style:default;"><b>
Table 3. Merged dataset of stock prices, technical indicators, and tone score.
</b></p>'''))
display(tesla.head())
To run different time series pipelines simultaneously, different notebooks were employed for each type of model. Classical Machine Learning models include k-Nearest Neighbors, regression and ensemble models and is located in the classical_ml.ipynb file. For deep learning, a Recurring Neural Network called Long-Short Term Memory was used as a sequential time series prediction problem and is located in the LSTM.ipynb file. Lastly, Merlion, a Python library for time series intelligence provides an end-to-end machine learning framework that includes automatically loading and transforming data, building and training models, post-processing model outputs, and evaluating model performance, was used in the time series prediction and is located in the merlion folder. Merlion is a comprehensive Machine Learning library which aggregates multiple features from different existing packages for Time Series Analysis.
For each model, three situations were considered: (1) base prediction where only Close was the only feature used in prediction, (2) prediction using Close and Mean Daily Average Tone was included and (3) prediction using Close and Cumulative Sum Daily Average Tone was used. For each of these models as well, three prediction windows were determined: 1-day, 3-day and 5-day. This makes a total of 9 iterations already for a single model. A total of 12 models were used. This was also done for each dataset for a total of 432 optimal models in total.
Mean Absolute Error (MAE) was the metric used to evaluate all of the models. It was used to company the actual historical values with the model's predicted values for ease of interpretability. This metric is readily interpretable because the unit is in dollars.
List of the models used:
k-Nearest Neighbors RegressorList of the models used:
List of the models used:
Only the best model out of all 12 was used. This was determined to be the Merlion forecaster. Its own DefaultForecaster model is simple to use but provides robust results with seasonality detection and hyperparameter tuning are built into the the model.
The authors used this forecaster as a benchmark to predict the impact that media tone has on several company stocks. Through webscraping, the team aggregated the closing stock prices of several companies such as Tesla, Facebook, Netflix, and Amazon. The aggregated media tone in the form of its average and cumulutive sum were added as a feature for forecasting the stock's closing price for each company. Each dataset was split into a 80% Train and 20% Test set which the forecaster generated predictions for. Predictions for each company's stock was generated in 1, 3, and 5 day sliding windows with the last value of each window being used as the predicted value as a trading strategy.
display(HTML('''<center style="font-size:12px;font-style:default;"><b>
Figure 7. Predictions for Tesla Stock using 1-, 3-, 5-day windows.
</b></center>'''))
display(_image(filename="img/tesla.png"))
For Tesla, we see a decrease in MAE using Tone in all types of prediction windows. Out of these as well, we see that the 3-day window is also the best in terms of MAE.
display(HTML('''<center style="font-size:12px;font-style:default;"><b>
Figure 8. Predictions for Facebook Stock using 1-, 3-, 5-day windows.
</b></center>'''))
display(_image(filename="img/facebook.png"))
For Facebook stock, only the 3-day and 5-day windows had improvements by adding Tone.
display(HTML('''<center style="font-size:12px;font-style:default;"><b>
Figure 9. Predictions for Netflix Stock using 1-, 3-, 5-day windows.
</b></center>'''))
display(_image(filename="img/netflix.png"))
In general, we did not even consider evaluating Netflix and Amazon due to the large MAE of their stock from the No Tone models. In general, if the stocks have high MAEs from the base model alone, we expect that adding Tone may not necessarily improve it.
display(HTML('''<center style="font-size:12px;font-style:default;"><b>
Figure 10. Predictions for Amazon Stock using 1-, 3-, 5-day windows.
</b></center>'''))
display(_image(filename="img/amazon.png"))
Other results from the other types of models can be viewed in the other notebooks attached as Appendices.
We evaluate our model further by performing backtesting. We start off by using a classical strategy of using the crossover of fast and slow moving average to decide when to buy or sell. To perform backtesting we used backtesting Python library.
df_full = pd.read_csv('data/tesla_complete.csv')
with open(r"data/day5strat.p", "rb") as input_file:
e = pickle.load(input_file)
df_bt = df_full[-210:].copy()
df_bt['y_pred'] = e.Pred.to_list()
df_bt = df_bt.set_index('Date')
df_bt.index = pd.to_datetime(df_bt.index)
class SmaCross(Strategy):
def init(self):
self.close = self.data.Close
self.ma1 = self.I(SMA, self.close, 10)
self.ma2 = self.I(SMA, self.close, 20)
def next(self):
if crossover(self.ma1, self.ma2):
self.buy()
elif crossover(self.ma2, self.ma1):
self.sell()
bt = Backtest(df_bt.loc['2018-06-15':'2019-04-16'], SmaCross, commission=.002,
exclusive_orders=True)
stats = bt.run()
display(stats)
We got a result of -68% using the classical method. This is understandable since our test set belongs to the bearish market of Tesla which is during 2018.
Next, we used our best machine learning model which is Merlion. We used the 3 day window as it performs best with our model and also because it would give enough time for the difference between stocks to go up. We do not want to have a very volatile outputs as we will become too sensitive from the market change.
class ModelStrategy(Strategy):
def init(self):
self.pred = self.data.y_pred
self.close = self.data.Close
self.ma1 = self.I(SMA, self.close, 10)
self.ma2 = self.I(SMA, self.close, 20)
def next(self):
pct = (abs(((self.pred - self.close)/self.close)*100) > 2)
if ((self.pred - self.close > 0) and pct):
self.buy()
elif ((self.pred - self.close <= 0) and pct):
self.sell()
bt = Backtest(df_bt.loc['2018-06-15':'2019-04-16'], ModelStrategy,
commission=.002,
exclusive_orders=True)
stats = bt.run()
display(stats)
Using our model alone, we were able to get a lower negative return. It means that it performs better than having crossover moving averages standalone. We then explore on combining the two methods.
class ModelSmaCross(Strategy):
def init(self):
self.pred = self.data.y_pred
self.close = self.data.Close
self.ma1 = self.I(SMA, self.close, 10)
self.ma2 = self.I(SMA, self.close, 20)
def next(self):
pct = (abs(((self.pred - self.close)/self.close)*100) > 2)
if (((self.pred - self.close > 0) and pct)
and crossover(self.ma1, self.ma2)):
self.buy()
elif (((self.pred - self.close <= 0) and pct)
and crossover(self.ma2, self.ma1)):
self.sell()
bt = Backtest(df_bt.loc['2018-06-15':'2019-04-16'], ModelSmaCross,
commission=.002,
exclusive_orders=True)
stats = bt.run()
display(stats)
Combining the machine learning model and the moving averages crossover we were able to get a lesser negative return in a bearish market.
Inclusion of tone can give improvements to prediction models for stock index even if the keywords only refer to a certain company name. This can be an objective and numerical representation of how the market perceives the company which is a reflection of systematic risk. However, this comes at a caveat that the company needs to have a strong public presence so that they are constantly mentioned in the GDELT database. Furthermore, time series models, specifically Merlion Forecaster, performs the best in predictions. Tesla specifically has a polarizing CEO. This kind of celebrity in their management maybe a better determinant of tone rather than one that is relatively quiet.
Feature engineering: Mean vs Cumulative Sum
The cumulative sum tends to emphasize the tone more than the mean so different companies will have different ways to represent tone. Further study can be made on feature engineering and what will maximize the value of tone for a company.
Prediction Windows
The general trend of results point to a 3-day window as the best choice, a possible indication of the late effect of media sentiment, 1-day may be too soon and 5-day might be too long.
Company Profiles
The authors determined two types of companies based on their base MAE. Only the smaller base errors can be effectively applied with tone. Tesla and Facebook had low values for MAE while Netflix and Amazon had larger values for MAE.
For the study, only the company name was used as a keyword or search term when scanning the GDELT database for articles that may relate to the target company. This may be improved by augmenting the choice of keywords to include topics related to the industry and index the company belongs too. Other keywords such as know icon or personalities associated with the company could also be included as a part of the multi-search term list. For example, if we were to expand the keywords for Tesla, we could include words such as self-driving, renewables, Elon Musk and the like or if we have oil companies such as Chevron as a target, we can include petroleum, natural gas, fracking, etc.
Since the GDELT database is being updated every 15 minutes, we could also explore the possibility of using a dataset with hourly granularity or one with a longer-range such as weekly. Other stock information could also be integrated aside from the closing price. The effect of each feature could be assessed and if viable, will be added to improve predicting capabilities. Also, other models could be explored to see if another one could outperform the merlion forecaster used in the study. Improving the current model with hyperparameter tuning could also contribute in its improvement.
More preprocessing steps related to time series forecasting should also be considered to improve MAE scores.
Lastly, since not all companies are directly suitable, increasing the variety of companies tested will complete the whole picture and improve the profiling of companies that could be analyzed using this method. A better understanding could be built over what type of companies could be associated to media sentiment changes.
[1] GDELT Project. https://www.gdeltproject.org/about.html
[2] GDELT Project (2015) "GDELT Event Codebook". http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf
[3] Chen J. (2022), "Systematic Risk", https://www.investopedia.com/terms/s/systematicrisk.asp
[4] Corporate Finance Institute. "Systematic Risk" https://corporatefinanceinstitute.com/resources/knowledge/finance/systematic-risk/